Read Input Data

Input data has be manually archived on the The Wayback Machine is a digital archive of the World Wide Web run by the Internet Archive, a nonprofit organization. Using the wayback package, “memento” files can be retrieved from the internet and scraped by the readr package into tibble data frames.

Read market data

First, we will read prediction market data courtesy of PredictIt, an exchange owned and operated by the Victoria University of Wellington. As part of their operating agreement with the Commodity Futures Trading Commission, PredicIt provides market history data for free to academic researchers.

The data was provided via email as a tab-separated file and can be loaded with readr. Two separate files were sent with the data on the Maine 2nd and New York 27th congressional districts, which were accidentally left out from the the main file. All data can be found in the /data folder.

Read member data

Congressional member data is used to provide party information as well as ideology and leadership scores. The data comes from the [the @unitedstates project]05 and GovTrack.

Read model data

Forecasting model data is courtesy of FiveThirtyEight, who provides the top-level output of their proprietary model for free to the public.

Read election results data

Election results data is courtesy of FiveThirtyEight and their parent company ABC News, whose Decision Desk called outcomes of races on election night.

This data is used to assess the accuracy of each predictive method.

Format Data for Comparison

Once data is collected from the Internet Archive, each tibble will need to be formatted in a similar style. This will be done using tidyverse data manipulation tools.

Ultimately, each tibble will need similar date and race variables, which together can be used to perform relational joins for comparison. Using all 4 primary data sets, we can create a tibble for each predictive method with all the data needed for comparison.

Format member data

members <- legislators_current %>%
  unite(first_name, last_name,
        col = name,
        sep = " ") %>%
  rename(gid     = govtrack_id,
         chamber = type,
         class   = senate_class,
         birth   = birthday) %>%
  select(name, gid, birth, state, district, class, party, gender, chamber) %>%
  arrange(chamber)

members$name     %<>% iconv(to = "ASCII//TRANSLIT")
members$name     %<>% str_replace_all("Robert Menendez", "Bob Menendez")
members$name     %<>% str_replace_all("Robert Casey",    "Bob Casey")
members$name     %<>% str_replace_all("Bernard Sanders", "Bernie Sanders")
members$chamber  %<>% recode("rep" = "house", "sen" = "senate")
members$district %<>% str_pad(width = 2, pad = "0")
members$class    %<>% str_pad(width = 2, pad = "S")
members$party    %<>% recode("Democrat"    = "D",
                             "Independent" = "D",
                             "Republican"  = "R")

members$district <- if_else(condition = is.na(members$district),
                            true = members$class,
                            false = members$district)

# Create district code as relational key
members %<>%
  unite(col = race,
        state, district,
        sep = "-",
        remove = TRUE) %>%
  select(-class) %>%
  arrange(name)

# Format member stats for join
members_stats <-
  bind_rows(sponsorshipanalysis_h, sponsorshipanalysis_s,
            .id = "chamber") %>%
  select(ID, chamber, party, ideology, leadership) %>%
  rename(gid = ID)
members_stats$chamber %<>% recode("1" = "house", "2" = "senate")
members_stats$party %<>% recode("Democrat"    = "D",
                                "Independent" = "D",
                                "Republican"  = "R")
members_stats$gid %<>% as.character()
# Add stats to frame by GovTrack ID
members %<>% inner_join(members_stats, by = c("gid", "party", "chamber"))

members

Format market data

markets <- DailyMarketData %>%
  rename(mid      = MarketId,
         name     = MarketName,
         symbol   = MarketSymbol,
         party    = ContractName,
         open     = OpenPrice,
         close    = ClosePrice,
         high     = HighPrice,
         low      = LowPrice,
         volume   = Volume,
         date     = Date) %>%
  select(date, everything()) %>%
  select(-ContractSymbol)

# Get candidate names from full market question
markets$name[str_which(markets$name, "Which party will")] <- NA
markets$name %<>% word(start = 2, end = 3)

# Recode party variables
markets$party %<>% recode("Democratic or DFL" = "D",
                          "Democratic"        = "D",
                          "Republican"        = "R")

# Remove year information from symbol strings
markets$symbol %<>% str_remove(".2018")
markets$symbol %<>% str_remove(".18")

# Divide the market symbol into the name and race code
markets %<>%
  separate(col = symbol,
           into = c("symbol", "race"),
           sep = "\\.",
           extra = "drop",
           fill = "left") %>%
  select(-symbol)

# Recode the original contract strings for race variables
markets$race %<>% str_replace("SENATE", "S1")
markets$race %<>% str_replace("SEN",    "S1")
markets$race %<>% str_replace("SE",     "S1")
markets$race %<>% str_replace("AL",     "01")   # at large
markets$race %<>% str_replace("OH12G",  "OH12") # not sure
markets$race %<>% str_replace("MN99",   "MNS2") # special election
markets$race[markets$name == "SPEC"] <- "MSS2"  # special election
markets$race[markets$mid  == "3857"] <- "CAS1"  # market name mustyped
markets$name[markets$name == "PARTY"] <- NA     # no name
markets$name[markets$name == "SPEC"]  <- NA     # no name

markets$race <- paste(str_sub(markets$race, 1, 2), # state abbreviation
                      sep = "-",                   # put hyphen in middle
                      str_sub(markets$race, 3, 4)) # market number)

# Remove markets incorectly repeated
# Some not running for re-election
markets %<>% filter(mid != "3455", # Paul Ryan
                    mid != "3507", # Jeff Flake
                    mid != "3539", # Shea-Porter
                    mid != "3521", # Darrell Issa
                    mid != "3522", # Repeat of 4825
                    mid != "4177", # Repeat of 4232
                    mid != "4824") # Repeat of 4776

# Divide the data based on market question syntax
# Market questions provided name or party, never both
markets_with_name <- markets %>%
  filter(is.na(party)) %>%
  select(-party)

markets_with_party <- markets %>%
  filter(is.na(name)) %>%
  select(-name)

# Join with members key to add party, then back with rest of market
markets <- markets_with_name %>%
  inner_join(members, by = c("name", "race")) %>%
  select(date, mid, race, party, open, low, high, close, volume) %>%
  bind_rows(markets_with_party)

# Add in ME-02 and NY-27 which were left out of initial data
ny_27 <- Contract_NY27 %>%
  rename_all(tolower) %>%
  slice(6:154) %>%
  mutate(mid = "4729",
         race = "NY-27",
         party = "R") %>%
  select(-average)

me_02 <- Market_ME02 %>%
  rename_all(tolower) %>%
  rename(party = longname) %>%
  filter(date != "2018-10-10") %>%
  mutate(mid = "4945",
         race = "ME-02")

markets_extra <-
  bind_rows(ny_27, me_02) %>%
  select(date, mid, race, party, open, low, high, close, volume)

markets_extra$party[str_which(markets_extra$party, "GOP")] <- "R"
markets_extra$party[str_which(markets_extra$party, "Dem")] <- "D"

# Bind with ME-02 and NY-27
markets %<>%  bind_rows(markets_extra)

markets

Format model data

# Format district for race variable
model_district <- house_district_forecast %>%
  mutate(district = str_pad(string = district,
                            width = 2,
                            side = "left",
                            pad = "0"))

# Format class for race variable
model_seat <- senate_seat_forecast %>%
  rename(district = class) %>%
  mutate(district = str_pad(string = district,
                            width = 2,
                            side = "left",
                            pad = "S"))

model_combined <-
  bind_rows(model_district, model_seat, .id = "chamber") %>%
  # Create race variable for relational join
  unite(col = race,
        state, district,
        sep = "-",
        remove = TRUE) %>%
  rename(name = candidate,
         date = forecastdate,
         prob = win_probability,
         min_share = p10_voteshare,
         max_share = p90_voteshare) %>%
  filter(name != "Others") %>%
  select(date, race, name, party, chamber, everything()) %>%
  arrange(date, name)

# Recode identifying variable for clarification
model_combined$chamber %<>% recode("1" = "house",
                                   "2" = "senate")

# Only special elections are for senate.
model_combined$special[is.na(model_combined$special)] <- FALSE

# Convert percent vote share values to decimal
model_combined[, 10:12] <- model_combined[, 10:12] * 0.01

# Recode incumbent Independent senators for relational joins with Markets
# Both caucus with Democrats and were endoresed by Democratic party
model_combined$party[model_combined$name == "Bernard Sanders"]   <- "D"
model_combined$party[model_combined$name == "Angus S. King Jr."] <- "D"
model_combined %<>% filter(name != "Zak Ringelstein")

# Seperate model data by model format
# According to 538, the "classic" model can be used as a default
model <- model_combined %>% 
  filter(model == "classic") %>% 
  select(-model)

model_lite <- model_combined %>% 
  filter(model == "lite") %>% 
  select(-model)

model_deluxe <- model_combined %>% 
  filter(model == "deluxe") %>% 
  select(-model)

model

Format election results

Compare Predictive Methods

Once each data frame has been properly formatted, they can be filtered to remove redundant predictions. Each row in both sets will contain the day’s probability of a Democratic party candidate winning.

## 
##  Welch Two Sample t-test
## 
## data:  hit by method
## t = 4.1209, df = 17433, p-value = 1.895e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.01338999        Inf
## sample estimates:
## mean in group market  mean in group model 
##            0.8603429            0.8380571
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  . out of nrow(hits)/2 %>% rep(2)
## X-squared = 16.794, df = 1, p-value = 4.166e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.01157269 0.03299874
## sample estimates:
##    prop 1    prop 2 
## 0.8603429 0.8380571
## 
##  Welch Two Sample t-test
## 
## data:  brier_score by method
## t = -0.33902, df = 16943, p-value = 0.7346
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.005016567  0.003537138
## sample estimates:
## mean in group market  mean in group model 
##            0.1083634            0.1091031

Explore Data Visually